Indonesian Journal of Electrical Engineering and Computer Science 
Vol. 27, No. 3, September 2022, pp. 1689~1697 
ISSN: 2502-4752, DOI: 10.1159 1/ijeecs.v27.i3.pp1689-1697 O 1689 


Alzheimer’s disease prediction using three machine learning 
methods 


Shaymaa Taha Ahmed}, Suhad Malallah Kadhem? 


‘Department of Computer Science, Faculty of Basic Education, University of Diyala, Diyala, Iraq 
Department of Computer Science, University of Technology, Baghdad, Iraq 


Article Info ABSTRACT 

Article history: Alzheimer's disease (AD) is the most common incurable neurodegenerative 
; illness, a term that encompasses memory loss as well as other cognitive 

Received Jun 1, 2022 abilities. The purpose of the study is using precise early-stage gene 

Revised Jun 16, 2022 expression data from blood generated from a clinical Alzheimer's dataset, 

Accepted Jul 4, 2022 the goal was to construct a classification model that might predict the early 


stages of Alzheimer's disease. Using information gain (IG), a selection of 
characteristics was chosen to provide substantial information for 
Keywords: distinguishing between normal control (NC) and early-stage AD 
participants. The data was divided into various sizes; three distinct machine 
learning (ML) algorithms were used to generate the classification models: 
support vector machine (SVM), Naive Bayes (NB), and k-nearest neighbors 


Alzheimer’s disease 
Gene expression 


Information gain (K-NN). Using the WEKA software tool and a variety of model performance 
Microarraytechnology measures, the capacity of the algorithms to effectively predict cognitive 
Support vector machine impairment status was compared and tested. The current findings reveal that 


an SVM-based classification model can accurately differentiate cognitively 
impaired Alzheimer's patients from normal healthy people with 96.6% 
accuracy. As discovered and validated a gene expression pattern in the blood 
that accurately distinguishes Alzheimer's patients and cognitively healthy 
controls, demonstrating that changes specific to AD can be detected far from 
the disease's core site. 


This is an open access article under the CC BY-SA license. 


Corresponding Author: 


Shaymaa Taha Ahmed 

Department of Computer Science, Faculty of Basic Education, University of Diyala 
Diyala, Iraq 

Email: mrs.sh.ta.ah@ gmail.com 


1. INTRODUCTION 

Alzheimer's disease (AD) dementia affects 40-50 million individuals worldwide, with the number 
more than doubling between 1990 and 2016 [1]. Alzheimer's disease (AD) is by far the most common type of 
dementia and is anticipated to become more widespread as the population ages. The costs are rising in 
tandem with the growth in its occurrence. Alzheimer's disease is estimated to have cost the globe $604 billion 
in 2010 [2]. By 2030, Alzheimer's disease is predicted to cost $2 trillion in healthcare worldwide, affecting 
more than 131 million individuals. As a result, AD is quickly becoming a major global health and economic 
issue, prompting intensive scientific research to identify underlying genetic risk factors and regulatory 
markers, as well as to reduce the estimated healthcare burden through early detection, particularly at 
presymptomatic stages, to lessen the expected cost of healthcare [3], [4]. The late-onset symptoms of AD are 
the subject of a lot of research neurofibrillary tangles, amyloid plaques, neuronal tangles, and other tangles 
are examples [5], [6]. Although these discoveries have diagnostic relevance, the overall therapeutic 
contributions of these late-onset Alzheimer's disease characteristics are unclear [7], [8]. Furthermore, clinical 


Journal homepage: http://ijeecs.iaescore.com 


1690 O ISSN: 2502-4752 


trials demonstrate that patients with Alzheimer's disease have a wide range of symptoms and respond to 
different treatments, implying that there are numerous biological origins for the disease. This complicates the 
investigation of AD even further [9], [10]. 

Data obtained by high-throughput gene expression describing has provided new paths for a better 
understanding of complicated disease mechanisms and pathways at the molecular level in recent years [11]. 
However, identifying embedded patterns in high throughput gene expression data is difficult due to the large 
dimension, small sample size, and noise. In the context of gene expression summary dataset analysis, the 
methods for identifying the most explanatory gene subsets through data reduction and feature selection are 
now divided into two categories [12]: i) method of marginal filtering and ii) method wrapper (embedded) 
[13], [14]. Univariate and multivariate marginal filtering are the two types of marginal filtering. The paired t- 
test (TS), information gain (IG), and pearson correlation coefficient (PCC) are examples of univariate 
filtering procedures [15]-[17]. If there are too many features, there will be over fitting issues, and if there are 
too few, key features will be missed [18]. As a result, feature selection is a critical component of modeling 
[19], [20]. Although it may sound ideal for developing a predictive and robust model, the difficulty of 
selecting highly relevant features present in the gene expression dataset, which has around 16,382 
characteristics. In such circumstances, feature selection and dimensionality reduction algorithms aid in 
identifying core feature(s) that have a significant impact on result prediction. The information gain, chi- 
squared test, and mean decrease gini test are among the statistical tests utilized in this study to find such gene 
expressions [21], [22]. 

Henceforth looked at the importance of feature selection in the detection of Alzheimer's disease and 
discovered a suitable selection approach that can better predict the disease in this study. The information gain 
(IG) is based on three biomarkers: MRI, PET, and CSF, all of which are recommended by the National 
Institute of Neurological and Communicative Disorders and Stroke and the Alzheimer's disease and related 
disorders association [23], [24]. To improve the accuracy of the classifications, a method was utilized to 
choose features, and the top 44 ranked feature subsets were constructed. Then, using the selected features, a 
support vector machine (SVM) algorithm-based classifier will be built to predict AD from healthy controls 
(HCs). The predictor's performance is evaluated using a 5-cross validation method. The findings of this 
study's experiments revealed the efficacy of the proposed feature selection strategy in the diagnosis of 
Alzheimer's disease. 


2. RELATED WORKS 

Researchers from a wide range of fields are interested in microarray data analysis. The following are 
some of the most recent proposals for microarray data analysis in the fields of artificial intelligence, machine 
learning, and other related fields. Several research projects have recently been carried out because of recent 
advancements in biomedical and information technology, resulting in several algorithms that are useful for 
AD prediction. In this section, therefore, look at some of the most recent methods that have been developed 
employing data mining techniques. 

Meng et al. [25], used four types of machine learning models which are support vector machine 
(SVM), Naive Bayes (NB), random forest (RF) and multilayer perceptron neural network (MLP-NN) to 
analysis gene expression of AD patients and normal people. They used in this experiment a dataset namely 
gene expression omnibus (GEO: GSE1297) maintained by National Center for Bioinformatics Information 
(NCBI). The statistical t-test method with the significance of p-value <0.05 is used as a gene selection for 
selecting the best gene subset. The results indicated the accuracy of the above models as follows: (87.10), 
(90.32), and (97.66) respectively. Among them, the MLP-NN model performs better than other models, on 
identifying the distinction between AD and normal genes and proving its efficacy. 

Scheubert et al. [26], presented a classification system for predicting AD from the dataset GSE5281, 
which is referred to as the AD dataset. A wrapper of genetic algorithm and support vector machine 
(GA/SVM) is used as a feature selection method to select a subset of relevant genes that improves the 
performance of classification. Six different classification methods: Naive Bayes (NB), C4.5 (decision tree), 
k-nearest neighbor (KNN), random forest (RF), SVM with Gaussian kernel and SVM with linear kernel have 
been used. The results indicated the accuracy of the above models as follows: (81.4), (78.9), (87.0), (87.0), 
(85.7) and (91.9) respectively. 

Huang et al. [27], presented a classification system for predicting AD from the dataset GSE63060 
and GSE63061, which is referred to as the AD dataset. Including analysis of variance (ANOVA) and mutual 
information (MI) is used as a feature selection method to select a subset of relevant genes that improves the 
performance of classification. Different classification methods: k-means algorithm and convolutional neural 
network (CNN) have been used. The results indicated the accuracy of the above models as follows: 0.886 and 
0.929 respectively. A fundamental challenge in deep learning is determining the network design that provides 
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the best prediction accuracy. This process involves choosing network hyper parameters, including the 
number of layers, transformation types, and training parameters. 

Eke et.al. [28], presented a classification system of AD using gene expression datasets namely: 
GSE63060 and GSE63061. These two datasets were merged. The least absolute shrinkage and selection 
operator (LASSO) feature selection method is used to detect the optimal subset. The classification models: 
support vector machine (SVM), random forest (RF) and logistic ridge regression (RR) have proved predictive 
in distinguishing between cognitively normal (CN), mild cognitive impairment (MCI), and subjects with AD. 
SVM, RF, and RR classification models achieved accuracy (0.773), (0.785), and (0.765) respectively. 

Niyas and Thiyagarajan [29], reported on the diagnosis of Alzheimer's disease, using machine 
learning techniques. Bermany and Rashid [30] used a supervised approach of support vector machine (SVM) 
model to classify image. Jha and Kwon [31], proposed a cluster analysis technique for AD diagnosis from 
Magnetic Resonance Brain Images in the light of the selection tree. Bi et al. [32], proposed the random 
neural system group to enhance the order of execution. The dataset for their experimentation was selected 
from the Alzheimer’s disease neuroimaging initiative (ADNI). The elman neural network was confirmed to 
be an ideal base classifier that utilizes the arbitrary neural system cluster that is dependent on the outcome of 
highlighted selections, with 92.31% of accuracy. 

Tanveer et al. [32] used on the random support vector machine clustering approach to diagnose AD 
and to group the disease regions into inferior frontal gyrus, superior frontal gyrus, precentral gyrus, and 
cingulate cortex. Voyle et al. [33] developed multiple models for the classification of AD using various 
machine learning techniques, namely multilayer perceptron, bagging, decision tree, coactive neuro-fuzzy 
inference system (CANFIS) and genetic algorithm. The CANIFIS method's classification precision was 
99.55%. Plant et al. [34] developed a hybrid model using SVM and Bayesian Classifier to detect the brain 
atrophy patterns as well as to predict AD. A pattern matching index of 92% was obtained for their method. 
Vega et al. [35] developed a new methodology for the classification of MR brain images into normal and AD 
affected images. Their method is underlined by MRI feature extraction in the wavelet domain followed by 
dimensionality reduction and SVM classification. 

Li et al. [36] proposed a current strategy based on the principal component analysis (PCA). It 
utilizes models of continuous selection and dropout as well as the model of restricted boltzmann machine 
(RBM), which is a profound teaching method. Ma et al. [37] created a technique to analyze AD from medical 
images using ML-based multimodal data fusion. 


3. MATERIALS AND METHODS 

The dataset is described in this section, techniques for preparing data, as well as the categorization 
model's machine learning algorithm. The section too includes a description of weka's statistical model 
performance evaluator, which can be used to analyze and compare the robustness and dependability of 
created models. It is shown in Figure 1. 
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Figure 1. An instance of the steps involved in creating a classification model for predicting AD 
in its early stages 
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3.1. Dataset 

There are numerous biological datasets available. The dataset for this paper is collected from the 
gene expression omnibus (GEO), which is a publicly available data source. The National Center for 
Bioinformatics Information (NCBI) released it on August 5, 2015. The dataset’s accession numbers are 
(GSE63060 and GSE63061) as provided by the AddNeuroMed Cohort. To expand the amount of the 
samples, these two datasets were merged into one, the AD dataset. The amount of gene expression in the AD 
dataset was monitored using microarray technology. It has appropriate columns for (16382 genes) and rows 
for (569 samples). It is made up of (245 patients with AD, 142 MCIs, and 182 CTLs). 


3.2. Feature cleaning/selection of features for reducing redundant data 

Selection of features/dimensionality reduction is a common technique for improving model 
correctness or boosting performance on very large datasets. For the following reasons, the feature selection 
method must be used in this scenario: 
— For some machine learning algorithms, inputting above 16,382 features may take too long to train. 
— Models are simplified to make them easier to interpret for researchers and users. 
— Reduced over fitting improves generalization (formally, reduction of variance). 
— Many features in the data could be redundant (highly correlated, linearly dependent) or irrelevant. 

These features can be disabled without causing significant data loss. 

The entropy (information gain) of a random collection of samples determines its impurity. The projected 
decrease in entropy owing to dividing the samples before splitting the feature node is known as information 
gain. It is a method of determining the connection between inputs (samples on X axis) and outputs (entropy on 
Y axis). As a result, the greater the knowledge gained the better as shown in Figure 2 [38]. 
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Figure 2. Information gain bar chart 
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The number of significant features after using various feature cleaning/selection procedures is by 
using IG can be obtained from 44 features. This represents a significant reduction in dimensionality from 
over 16,382 features. These 44 characteristics are utilized to create a prediction model. 


type 


i numeric 


i 
O 


3.4. Machine learning algorithms for model building 

Subjects are classified using a machine learning algorithm based on related properties. In this 
research, three well-known ML algorithms, specifically Naive Bayes (NB), support vector machine (SVM), 
and K-NN, were used to categorize people into impaired and healthy control groups based on selected 
attributes. The prediction ability of each model was assessed and compared using several statistical methods. 
The following is a quick rundown of the machine learning approaches used to create a classification model 
that can tell the difference between MCI sufferers and healthy people in the current study: 
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3.4.1. Naive Bayes (NB) 

The Naive Bayes (NB) approach is established on the premise of each of the training dataset's 
predictive features (X1, X2... Xn) is conditionally independent. The Bayes theorem is used by the NB 
method to classify attributes in the test dataset. It calculates the probability of an attribute being categorized 
in any of the provided classes in the past. Prior experience determines an attribute's prior probability, 
according to the Bayes rule; as a result, the participants in the test case are divided into classes based on 
several qualities’ conditional probabilities. Second, the percentage of subjects in any of the classes with 
similar traits determines the possibility of a topic being classified in one of the classes. In terms of NB 
analysis, in a dataset, a subject's final classification is established by multiplying prior and probability 
information about an attribute to generate a posterior possibility. A subject is assigned to a group if he or she 
has a higher possibility of having traits in that class [36]. The following is a description of the NB algorithm: 
Suppose that the likelihood of a person "X" with certain characteristics is X = < x1...,xn> belonging to the 
impaired class, which is denoted by the letter "h" and is represented as follows: 


(P(xi/h1) P(h1) 
(P(xi/h1) P(h1)+ (P(xi/h2) P(h2) 


P(h1/xi) = (a) 


P (h1) is the previously related probability with class AZ, and P (h1|xi) is the posterior probability. As a 
result, have "n" different hypotheses as shown in (2). 


(P(xi/h1) P(h1) 


P(h1/xi) = EEA (2) 
As a result: 
P(xi) = Dar Oi) P(hj) (3) 


3.4.2. Support vector machine (SVM) 

SVM is a supervised machine learning technique that can tackle issues in classification and 
regression. Support vector (frontier) is simply the coordinates of individual observation in the hyper-plane 
which best segregates or differentiates the two classes [37]. The decision function of SVM is stated as, after 
solving a convex optimization problem. 


f (x) = sin(wT x + b) (4) 


Where w is the weight vector and b are biased. 

Pros. with high dimensional spatial data, one of the most efficient and effective supervised machine 
learning algorithms. Clearer margin of separation between the support vectors results in better prediction. It's 
especially useful when the number of dimensions exceeds the number of samples [38] which is the perfect 
machine learning algorithm for the gene expression dataset (16,382 dimensions, 569 samples). In addition, it 
is a very memory efficient algorithm [39]. Cons. computation time is usually longer than a normal machine 
learning algorithm. It performs with noise data such as data with lots of highly correlated features [40]. Show 
in Figure 3. 
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Figure 3. SVM and TWSVM classifiers are shown on a graph [39] 
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3.4.3. K-nearest neighbors (K-NN) 

The K-NN approach is a straightforward data mining tool that may be used to solve both 
classification and regression problems. Based on the majority classes of its K neighbors, the K-NN 
classification algorithm gives an object to a certain class. The number of neighbors to be considered for 
polling is defined by the value of K, which is a positive integer. The value of K in this analysis is 11, which 
was chosen using the trial and error method [40]. KNN classifier shown in Figure 4. 
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Figure 4. KNN classifier [41] 


4. RESULT ANALYSIS /DATA PREPROCESSING 


To properly feed data to each machine learning algorithm, data preprocessing is essential. 
Furthermore, the quality of the data has a significant impact on the performance of the machine learning 
algorithm [41]. As the gene expression dataset's format isn't the greatest for fitting into the algorithms, data 
transformation has been done. This gene expression dataset is now formatted as row (gene expression) x 
column (individual), but it should be formatted as row (individual) x column (individual) (gene expression). 
The classification performance of the classifiers during the validation and testing period is shown in Table 1. 


Table 1. Performances of three proposed models 


Classifier Accuracy 
K-NN 89% 
Naive Bayes 85% 
SVM 96% 


Evaluation of performance: Over-fitting occurs when a model's parameters are discovered and tested 
on the same dataset, resulting in flawless accuracy when training with seen data but considerably erroneous 
results when training with unseen data. Cross-validation (CV) is used in this work to avoid over-fitting, while 
accuracy and area under the curve are used to quantify performance. Accuracy: each prediction has two 
values: one is the chance of having a sickness, which is '1', and the other is the probability of not having a 
disease, which is '0.' The likelihood of not having a sickness, which is '0,' is another value. The accuracy is 
based on a 0.1 threshold, which means that if the value has 0.1, it is a "1," else it is a "0." The cost of a false 
positive and false negative is assumed to be equal; hence the threshold is set at 0.1. 

Area under the curve (AUC): AUC is a receiver operating characteristics (ROC) curve-based 
measuring of accuracy that shows the tradeoff between sensitivity and specificity. Sensitivity and specificity 
have an inverse relationship (increasing sensitivity results in decreasing specificity). The optimum and most 
accurate AUC curve spans the entire ROC space from the bottom right to the top left. The ROC space curve 
with a 45-degree diagonal has 50% predictive power, which is a randomly determined classification. 

Cross validation (CV): To test the model, k-fold cross-validation (CV) divides the training set into k 
smaller sets called validation sets and utilizes the rest of the data set to train the model. It switches to the next 
smaller subsets to test the model, and so on, with each iteration. The average of each score is the result. 

— Each model in this study is trained with k=5. 
— To test the performance measure, which in this case is accuracy, a smaller set is employed. 
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5. DISCUSSION 

All the classification models in this study were on the training dataset, trained with 5 folds cross- 
validation. To avoid over-fitting, the models' cross validation is used using the training dataset. The 
classifiers’ performance is assessed using an unknown test dataset. Using the IG algorithm also chose the 
most discriminative traits that would most help in the diagnosis of early-stage AD. To identify the most 
significant features, use the SVM classifier on the data as shown in Figure 5. 
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Figure 5. Applying SVM classifier to select features 


The accuracy of the results varies depending on the number of features employed. With 44 features, 
it had the best accuracy of 96.6. By lowering the threshold term (i.e., set to less than 125 mean) in this 
approach of feature selection, the mean cross-validation accuracy improves as the number of features 
increases. 


6. CONCLUSION 

We suggested an approach to improve AD diagnosis prediction by incorporating a selection of 
features for the predictor. We introduce three methods of machine learning classifiers (SVM, K-NN and 
Naive Bayes) with method of feature selection and show the results which are the correctly classified and 
incorrectly classified and can identify the most appropriate features for SVM model training, 5-fold cross- 
validation was used in this study, and low variance of prediction error was attained, demonstrating the 
robustness of our method. In medicine and healthcare studies, machine learning and data mining techniques 
are particularly useful for the early identification and diagnosis of a variety of disorders. The biggest 
advantage of our method over previous feature selection models is that the training system has automatically, 
for a better prediction, replenished the required features. The best classification methods as we saw above 
were in SVM with feature selection information gain (96.6) accuracy. However, past research has linked 
several of the qualities chosen in our technique to Alzheimer's disease or other psychological disorders, 
demonstrating the model's efficacy. Furthermore, the findings demonstrated that machine learning and data 
mining approaches can be utilized to accurately detect, predict, and diagnose a variety of diseases. Increase 
the number of examples for AD and NC classes so that the model may be trained with enough and balanced 
data for all classes to increase the accuracy of the AD stages categorization. 
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